Combining Pos Taggers for Improved Accuracy to Create Telugu Annotated Texts for Information Retrieval

نویسندگان

  • Rama Sree
  • Kusuma Kumari
چکیده

POS Tagging is the process of assigning a correct POS tag (can be a noun, verb, adjective, adverb, or other lexical category marker) to each word of the sentence. POS taggers are developed by modeling the morpho-syntactic structure of natural language text. We attempted to improve the accuracy of existing Telugu POS taggers by using an voting algorithm. The three Telugu Pos taggers viz., (1) Rule-based POS tagger (2) Brill Tagger (3) Maximum Entropy POS taggers are developed with an accuracy of 98.016%, 92.146%, and 87.818 respectively. An annotated corpus of 12000 words is used to train the last two taggers. An error analysis is made to find out the errors made by these three taggers and methods to improve the accuracy of these taggers are then examined. As a first step, a voting algorithm is proposed to build an ensemble Telugu POS tagger to get better results. This tagged output could be used for a variety of NLP (Natural Language Processing) applications, mainly used for word sense disambiguation (WSD) is retrieving Telugu documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication

We assess the performance of off-the-shelve POS taggers when applied to two types of Internet texts in German, and investigate easy-to-implement methods to improve tagger performance. Our main findings are that extending a standard training set with small amounts of manually annotated data for Internet texts leads to a substantial improvement of tagger performance, which can be further improved...

متن کامل

Parts Of Speech Tagging for Indian Languages: A Literature Survey

Part of speech (POS) tagging is the process of assigning the part of speech tag or other lexical class marker to each and every word in a sentence. In many Natural Language Processing applications such as word sense disambiguation, information retrieval, information processing, parsing, question answering, and machine translation, POS tagging is considered as the one of the basic necessary tool...

متن کامل

A Comparison of Three Machine Learning Methods for Amazigh POS Tagging

Part of speech tagging (POS tagging) has a crucial role in different fields of natural language processing (NLP) including Speech Recognition, Natural Language Parsing, Information Retrieval and Multi Words Term Extraction. This paper describes a set of experiments involving the application of three state-of the-art part-of-speech taggers to Amazigh texts, using a tagset of 28 tags. The taggers...

متن کامل

A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam...

متن کامل

Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008